Graphs are often the starting point for statistical analysis. One of the main advantages of R is how easy it is for the user to create many different kinds of graphs. We begin this chapter by studying conventional graphs, followed by an examination of some more complex representations. This final part uses the ggplot2 package.
To begin with, it may be interesting to examine a few example of graphical representations which can be constructed with R. We use the demo function:
demo(graphics)
The plot function is a generic function used to represent all kinds of data. Classical use of the plot function consists of representing a scatterplot for a variable y according to another variable x. For example, to represent the graph of the function \(x\mapsto \sin(2\pi x)\) on \([0,1]\), at regular steps we use the following commands:
x <- seq(-2*pi,2*pi,by=0.05)
y <- sin(x)
plot(x,y) #dot representation (default)
plot(x,y,type="l") #line representation
We provide examples of representations for quantitative and qualitative variables. We use the data file ozone.txt, imported using
path <- file.path("../DATA", "ozone.txt")
ozone <- read.table(path)
summary(ozone)
## maxO3 T9 T12 T15
## Min. : 42.00 Min. :11.30 Min. :14.00 Min. :14.90
## 1st Qu.: 70.75 1st Qu.:16.20 1st Qu.:18.60 1st Qu.:19.27
## Median : 81.50 Median :17.80 Median :20.55 Median :22.05
## Mean : 90.30 Mean :18.36 Mean :21.53 Mean :22.63
## 3rd Qu.:106.00 3rd Qu.:19.93 3rd Qu.:23.55 3rd Qu.:25.40
## Max. :166.00 Max. :27.00 Max. :33.50 Max. :35.50
## Ne9 Ne12 Ne15 Vx9
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :-7.8785
## 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:-3.2765
## Median :6.000 Median :5.000 Median :5.00 Median :-0.8660
## Mean :4.929 Mean :5.018 Mean :4.83 Mean :-1.2143
## 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:7.00 3rd Qu.: 0.6946
## Max. :8.000 Max. :8.000 Max. :8.00 Max. : 5.1962
## Vx12 Vx15 maxO3v vent pluie
## Min. :-7.878 Min. :-9.000 Min. : 42.00 Est :10 Pluie:43
## 1st Qu.:-3.565 1st Qu.:-3.939 1st Qu.: 71.00 Nord :31 Sec :69
## Median :-1.879 Median :-1.550 Median : 82.50 Ouest:50
## Mean :-1.611 Mean :-1.691 Mean : 90.57 Sud :21
## 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.:106.00
## Max. : 6.578 Max. : 5.000 Max. :166.00
Let us start by representing two quantitative variables: maximum ozone maxO3 according to temperature T12:
plot(ozone[,"T12"],ozone[,"maxO3"])
As the two variables are contained and named within the same table, a simpler syntax can be used, which automatically inserts the variables as labels for the axes:
plot(maxO3~T12,data=ozone)
We can also use (more complicated)
plot(ozone[,"T12"],ozone[,"maxO3"],xlab="T12",ylab="maxO3")
Functions histogram, barplot and boxplot allow to draw classical graphs:
hist(ozone$maxO3,main="Histogram")
barplot(table(ozone$vent)/nrow(ozone),col="blue")
boxplot(maxO3~vent,data=ozone)
We can use this package to obtain dynamic graphs. It is easy, we just have to use the prefix am beforme the name of the function:
library(rAmCharts)
amHist(ozone$maxO3)
amPlot(ozone,col=c("T9","T12"))
amBoxplot(maxO3~vent,data=ozone)
x <- seq(0,2*pi,length=1000)
plot(x,sin(x),type="l")
title("Plot of the sine function")
x <- seq(-4,4,by=0.01)
plot(x,dnorm(x),type="l")
abline(v=0,lty=2)
lines(x,dt(x,5),col=2)
lines(x,dt(x,30),col=3)
legend("topleft",legend=c("normal","Student(5)","Student(30)"),col=1:3,lty=1)
We consider the ozone dataset. With the layout function, split the window into two lines with
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
plot(maxO3~T12,data=ozone)
hist(ozone$T12)
boxplot(ozone$maxO3)
ggplot2 is a plotting system for R based on the grammar of graphics (as dplyr to manipulate data). We can find documentation here. We consider a subsample of the diamond dataset from the package ggplot2 (or tidyverse):
library(tidyverse)
## ── Attaching packages ────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
set.seed(1234)
diamonds2 <- diamonds[sample(nrow(diamonds),5000),]
summary(diamonds2)
## carat cut color clarity depth
## Min. :0.2000 Fair : 158 D: 640 SI1 :1189 Min. :43.00
## 1st Qu.:0.4000 Good : 455 E: 916 VS2 :1157 1st Qu.:61.10
## Median :0.7000 Very Good:1094 F: 900 SI2 : 876 Median :61.80
## Mean :0.7969 Premium :1280 G:1018 VS1 : 738 Mean :61.76
## 3rd Qu.:1.0400 Ideal :2013 H: 775 VVS2 : 470 3rd Qu.:62.50
## Max. :4.1300 I: 481 VVS1 : 326 Max. :71.60
## J: 270 (Other): 244
## table price x y
## Min. :49.00 Min. : 365 Min. : 0.000 Min. :3.720
## 1st Qu.:56.00 1st Qu.: 945 1st Qu.: 4.720 1st Qu.:4.720
## Median :57.00 Median : 2376 Median : 5.690 Median :5.700
## Mean :57.43 Mean : 3917 Mean : 5.728 Mean :5.731
## 3rd Qu.:59.00 3rd Qu.: 5294 3rd Qu.: 6.530 3rd Qu.:6.520
## Max. :95.00 Max. :18757 Max. :10.000 Max. :9.850
##
## z
## Min. :0.000
## 1st Qu.:2.920
## Median :3.520
## Mean :3.538
## 3rd Qu.:4.030
## Max. :6.430
##
help(diamonds)
Given a dataset, a graph is defined from many layers. We have to specify:
Ggplot graphs are defined from these layers. We indicate
The scatterplot carat vs price is obtained with the plot function with
plot(price~carat,data=diamonds2)
With ggplot, we use
ggplot(diamonds2) #nothing
ggplot(diamonds2)+aes(x=carat,y=price) #nothing
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point() #good
ggplot(diamonds2)+aes(x=carat)+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds2)+aes(x=carat)+geom_histogram(bins=10)
ggplot(diamonds2)+aes(x=cut)+geom_bar()
In ggplot, the syntax is defined from independent elements. These elements define the grammar of ggplot. Main elements of the grammar include:
All these elements are conbined with a +.
Data and aestheticsThese two elements specify the data and the variables we want to represent. For a scaterplot price vs carat we enter the command
ggplot(diamonds2)+aes(x=carat,y=price)
aes also use arguments such as color, size, fill. We use these arguments as soon as a color or a size is defined from a variable of the dataset:
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)
GeometricsTo obtain the graph, we need to precise the type of representation. We use geometrics to do that. For a scatter plot, we use geom_point:
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()
Observe that ggplot adds the lengend automatically. Exemples of geometrics are described here:
| Geom | Description | Aesthetics |
|---|---|---|
| geom_point() | Scatter plot | x, y, shape, fill |
| geom_line() | Line (ordered according to x) | x, y, linetype |
| geom_abline() | Line | slope, intercept |
| geom_path() | Line (ordered according to the index) | x, y, linetype |
| geom_text() | Text | x, y, label, hjust, vjust |
| geom_rect() | Rectangle | xmin, xmax, ymin, ymax, fill, linetype |
| geom_polygon() | Polygone | x, y, fill, linetype |
| geom_segment() | Segment | x, y, fill, linetype |
| geom_bar() | Barplot | x, fill, linetype, weight |
| geom_histogram() | Histogram | x, fill, linetype, weight |
| geom_boxplot() | Boxplots | x, y, fill, weight |
| geom_density() | Density | x, y, fill, linetype |
| geom_contour() | Contour lines | x, y, fill, linetype |
| geom_smooth() | Smoothers (linear or non linear) | x, y, fill, linetype |
| All | color, size, group |
ggplot(diamonds2)+aes(x=cut)+geom_bar(fill="blue")
ggplot(diamonds2)+aes(x=cut,fill=cut)+geom_bar()
Statistics (this part can be omitted for beginners)Many graphs need to transform the data to make the representation (barplot, histogram). Simple transformations can be obtained quickly. For instance we can draw the sine function with
D <- data.frame(X=seq(-2*pi,2*pi,by=0.01))
ggplot(D)+aes(x=X,y=sin(X))+geom_line()
The sine transformation is precised in aes. For more complex transformations, we have to used statistics. A stat function takes a dataset as input and returns a dataset as output, and so a stat can add new variables to the original dataset. It is possible to map aesthetics to these new variables. For example, stat_bin, the statistic used to make histograms, produces the following variables:
count, the number of observations in each bindensity, the density of observations in each bin (percentage of total / bar width)x, the center of the binBy default geom_histogram represents on the \(y\)-axis the number of observations in each bin (the outuput count).
ggplot(diamonds)+aes(x=price)+geom_histogram(bins=40)
For the density, we use
ggplot(diamonds)+aes(x=price,y=..density..)+geom_histogram(bins=40)
ggplot propose another way to make the representations: we can use stat_ instead of geom_. Formally, each stat function has a geom and each geom has a stat. For instance,
ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")
ggplot(diamonds2)+aes(x=carat,y=price)+stat_smooth(method="loess")
lead to the same graph. We can change the type of representation in the stat_ with the argument geom:
ggplot(diamonds2)+aes(x=carat,y=price)+stat_smooth(method="loess",geom="point")
Here are some examples of stat functions
| Stat | Description | Parameters |
|---|---|---|
| stat_identity() | No transformation | |
| stat_bin() | Count | binwidth, origin |
| stat_density() | Density | adjust, kernel |
| stat_smooth() | Smoother | method, se |
| stat_boxplot() | Boxplot | coef |
stat and geom are not always easy to combine. For beginners, we recommand to only use geom.
We consider a color variable \(X\) with probability distribution \[P(X=red)=0.3,\ P(X=blue)=0.2,\ P(X=green)=0.4,\ P(X=black)=0.1\] Draw the barplot of this distribution.
X <- data.frame(X1=c("red","blue","green","black"),X2=c(0.3,0.2,0.4,0.1))
ggplot(X)+aes(x=X1,y=X2,fill=X1)+geom_bar(stat="identity")
ScalesScales control the mapping from data to aesthetic attributes (change of colors, sizes…). We generally use this element at the end of the process to refine the graph. Scales are defined as follows:
For instance,
ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()+
scale_color_manual(values=c("Fair"="black","Good"="yellow",
"Very Good"="blue","Premium"="red","Ideal"="green"))
Here are the main scales:
| aes | Discrete | Continuous |
|---|---|---|
| Couleur (color et fill) | brewer | gradient |
| - | grey | gradient2 |
| - | hue | gradientn |
| - | identity | |
| - | manual | |
| Position (x et y) | discrete | continous |
| - | date | |
| Forme | shape | |
| - | identity | |
| - | manual | |
| Taille | identity | size |
| - | manual |
Some examples:
color of a barplotp1 <- ggplot(diamonds2)+aes(x=cut)+geom_bar(aes(fill=cut))
p1
We change colors by using the palette Purples :
p1+scale_fill_brewer(palette="Purples")
Gradient color for a scatter plot :p2 <- ggplot(diamonds2)+aes(x=carat,y=price)+geom_point(aes(color=depth))
p2
We change the gradient color
p2+scale_color_gradient(low="red",high="yellow")
Change on the axisp2+scale_x_continuous(breaks=seq(0.5,3,by=0.5))+scale_y_continuous(name="prix")+scale_color_gradient("Profondeur")
Group and facetsggplot allows to make representations for subgroup of individuals. We can proceed in two ways:
We can represent (on the same graph) the smoother price vs carat for each modality of cut with
ggplot(diamonds2)+aes(x=carat,y=price,group=cut,color=cut)+geom_smooth(method="loess")
To obtain the representation on many graphs, we use
ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")+facet_wrap(~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+geom_smooth(method="loess")+facet_wrap(~cut,nrow=1)
facet_grid and facet_wrap do the same job but split the screen in different ways:
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+geom_smooth(method="lm")+facet_grid(color~cut)
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()+geom_smooth(method="lm")+facet_wrap(color~cut)
Syntax for ggplot is defined according to the following scheme:
ggplot()+aes()+geom_()+scale_()
It is really flexible: for instance aes could also be specified in ggplot or in geom_
ggplot(diamonds2)+aes(x=carat,y=price)+geom_point()
ggplot(diamonds2,aes(x=carat,y=price))+geom_point()
ggplot(diamonds2)+geom_point(aes(x=carat,y=price))
We can also built a graph with many datasets:
X <- seq(-2*pi,2*pi,by=0.001)
Y1 <- cos(X)
Y2 <- sin(X)
donnees1 <- data.frame(X,Y1)
donnees2 <- data.frame(X,Y2)
ggplot(donnees1)+geom_line(aes(x=X,y=Y1))+
geom_line(data=donnees2,aes(x=X,y=Y2),color="red")
Many other functions are proposed by ggplot:
p <- ggplot(diamonds2)+aes(x=carat,y=price,color=cut)+geom_point()
p+theme_bw()
p+theme_classic()
p+theme_grey()
p+theme_bw()
X <- seq(-2*pi,2*pi,by=0.001)
Y1 <- cos(X)
Y2 <- sin(X)
donnees1 <- data.frame(X,Y1)
donnees2 <- data.frame(X,Y2)
ggplot(donnees1)+geom_line(aes(x=X,y=Y1))+
geom_line(data=donnees2,aes(x=X,y=Y2),color="red")
donnees <- data.frame(X,Y1,Y2)
ggplot(donnees)+aes(x=X,y=Y1)+geom_line()+geom_line(aes(y=Y2),color="red")
df <- data.frame(X,cos=Y1,sin=Y2)
library(tidyverse)
#df1 <- melt(df,id.vars="X")
df1 <- gather(df,key="func",value="value",-X)
#or
df1 <- gather(df,key="func",value="value",cos,sin)
ggplot(df1)+aes(x=X,y=value,color=func)+geom_line()
ggplot(df1)+aes(x=X,y=value)+geom_line()+facet_wrap(~func)
library(gridExtra)
p1 <- ggplot(donnees1)+aes(x=X,y=Y1)+geom_line()
p2 <- ggplot(donnees2)+aes(x=X,y=Y2)+geom_line()
grid.arrange(p1,p2,nrow=1)
We consider the dataset mtcars
data(mtcars)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
ggplot(mtcars)+aes(x=mpg)+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mtcars)+aes(x=mpg)+geom_histogram(bins=10)
ggplot(mtcars)+aes(x=mpg,y=..density..)+geom_histogram(bins=10)
ggplot(mtcars)+aes(x=cyl)+geom_bar()
ggplot(mtcars)+aes(x=disp,y=mpg,color=cyl)+geom_point()
ggplot(mtcars)+aes(x=disp,y=mpg,color=as.factor(cyl))+geom_point()
ggplot(mtcars)+aes(x=disp,y=mpg,color=as.factor(cyl))+geom_point()
ggplot(mtcars)+aes(x=disp,y=mpg,color=as.factor(cyl))+geom_point()+geom_smooth(method="lm")
df <- data.frame(X=seq(-2*pi,2*pi,by=0.01))
ggplot(df)+aes(x=X,y=sin(X))+geom_line()+geom_hline(yintercept=c(-1,1),color="blue",size=2)
n <- 100
X <- runif(n)
eps <- rnorm(n,sd=0.2)
Y <- 3+X+eps
D <- data.frame(X,Y)
model <- lm(Y~.,data=D)
co <- coef(model)
D$fit <- predict(model)
co <- coef(lm(Y~.,data=D))
ggplot(D)+aes(x=X,y=Y)+geom_point()+geom_abline(slope=co[2],intercept=co[1],color="blue")
#2nd method
ggplot(D)+aes(x=X,y=Y)+geom_point()+geom_smooth(method="lm")
3. Draw the residuals: add a vertical line from each point to the linear smoother (use geom_segment).
ggplot(D)+aes(x=X,y=Y)+geom_point()+geom_smooth(method="lm")+geom_segment(aes(xend=X,yend=fit))
We consider the diamonds dataset.
ggplot(data=diamonds) + geom_boxplot(aes(x=cut,y=carat,fill=cut))
ggplot(data=diamonds) + geom_boxplot(aes(x=cut,y=carat,fill=cut))+coord_flip()
ggplot(data=diamonds) + geom_density(aes(x=carat,y=..density..)) + facet_grid(cut~.)
Q1 <- diamonds %>% group_by(cut) %>% summarize(q1=quantile(carat,c(0.25)),q2=quantile(carat,c(0.5)),q3=quantile(carat,c(0.75)))
quantildf <- Q1%>% gather(key="alpha",value="quantiles",-cut)
ggplot(data=diamonds) + geom_density(aes(x=carat,y=..density..)) + facet_grid(cut~.) + geom_vline(data=quantildf,aes(xintercept=quantiles),col=alpha("black",1/2))
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
ggplot(data=diamonds) +
geom_boxploth(data=diamonds,aes(y=-0.5,x=carat,fill=cut)) +
geom_density(aes(x=carat,y=..density..)) + facet_grid(cut~.) +
geom_vline(data=quantildf,aes(xintercept=quantiles),col=alpha("black",1/2))
We consider the dataset about the 4 tennis major tournaments in 2013 studied in the previous sheet.
FrenchOpen_men_2013 <- read_csv("~/Dropbox/LAURENT/COURS/ENS_SPORT/DONNEES/FrenchOpen-men-2013.csv")
RG2013 <- FrenchOpen_men_2013
WIMB2013 <- read_csv("~/Dropbox/LAURENT/COURS/ENS_SPORT/DONNEES/Wimbledon-men-2013.csv")
RG_WIMB2013 <- bind_rows("RG"=RG2013,"WIMB"=WIMB2013,.id="Tournament")
RG_WIMB2013 %>% mutate(nb_aces=ACE.1+ACE.2) %>% ggplot()+aes(x=nb_aces,color=Tournament,fill=Tournament)+geom_histogram(bins=15)+facet_wrap(~Tournament,nrow=2)
RG_WIMB2013 %>% mutate(nb_aces=ACE.1+ACE.2) %>% ggplot()+aes(y=nb_aces,x=Tournament)+geom_boxplot()
RG_WIMB2013 %>% mutate(NPA=NPA.1+NPA.2) %>% ggplot()+aes(y=NPA,x=Tournament)+geom_boxplot()
df <- RG2013 %>% select(Player1,Player2,Result,FSP.1,FSP.2) %>% mutate(FSP.W=FSP.1*(Result==1)+FSP.2*(Result==0),FSP.L=FSP.1*(Result==0)+FSP.2*(Result==1))
df1 <- df %>% select(FSP.W,FSP.L) %>% gather(key=Result,value=FSP)
ggplot(df1)+aes(x=Result,y=FSP)+geom_boxplot()
df <- WIMB2013 %>% select(Player1,Player2,Result,FSP.1,FSP.2) %>% mutate(FSP.W=FSP.1*(Result==1)+FSP.2*(Result==0),FSP.L=FSP.1*(Result==0)+FSP.2*(Result==1))
df1 <- df %>% select(FSP.W,FSP.L) %>% gather(key=Result,value=FSP)
ggplot(df1)+aes(x=Result,y=FSP)+geom_boxplot()
df <- RG_WIMB2013 %>% select(Player1,Player2,Result,FSP.1,FSP.2,Tournament) %>% mutate(FSP.W=FSP.1*(Result==1)+FSP.2*(Result==0),FSP.L=FSP.1*(Result==0)+FSP.2*(Result==1))
ggplot(df)+aes(x=Tournament,y=FSP.W)+geom_boxplot()